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Abstract —We consider the problem of completing a matrix 
with categorical-valued entries from partial observations. This 
is achieved by extending the formulation and theory of one- 
bit matrix completion Q. We recover a low-rank matrix X by 
maximizing the likelihood ratio with a constraint on the nuclear 
norm of X, and the observations are mapped from entries of X 
through multiple link functions. We establish theoretical upper 
and lower bonnds on the recovery error, which meet np to 
a constant factor 0{K^^^) where K is the fixed number of 
categories. The upper bound in our case depends on the number 
of categories Implicitly through a maximization of terms that 
involve the smoothness of the fink functions. In contrast to one-bit 
matrix completion, our bounds for categorical matrix completion 
are optimal up to a factor on the order of the sqnare root of 
the number of categories, which is consistent with an intuition 
that the problem becomes harder when the number of categories 
increases. By comparing the performance of our method with 
the conventional matrix completion method on the MovieLens 
dataset, we demonstrate the advantage of our method. 

I. Introduction 

Recovering a low-rank matrix M from a subset of its entries 
is a fundamental problem that arises from many real-world 
applications. The so-called matrix completion problem was 
originally formulated as estimating a matrix M with real¬ 
valued entries, subjecting to the data fit constraint a 0- 
However, in many problems entries are categorical (e.g., in 
recommender systems the ratings take integer values 1 to 5, 
or in health care applications, where the results are positive, 
negative or uncertain.) A better formulation for these scenarios 
would be categorical matrix completion. 

In this paper, we consider categorical matrix completion by 
extending the formulation of one-bit matrix completion Q to 
deal with categorical entries and adopt the proof techniques 
to obtain upper and lower bounds. Assume the input variables 
form a low-rank matrix X, and we observe partial entries of 
a matrix which are categorical responses of the underlying 
low-rank matrix. A new problem arises in the categorical 
setting is to choose appropriate link functions /fc’s that map 
entries of X to entries of the observed matrix M. We consider 
multinomial logistic regression link functions, which are 
smooth and they are easy to construct for an arbitrary number of 
categories. We consider a nuclear norm regularized maximum 
likelihood estimator with a likelihood function for categorical 
distribution (different from the Bernoulli distribution used 
in the one-bit case). To obtain theoretical upper and lower 
bounds, we introduce new conditions taking in account of the 
characteristics of the categorical distribution. Our upper and 
lower bounds match up to a factor that is on the order of the 
square root of the number of categories. Finally, we compare 


the performance of our method with the convention matrix 
completion method on the MovieLens dataset. 

As mentioned, a closely related work is one-bit matrix 
completion 0, where the matrix entries are binary valued 
and therein the authors establish theoretical upper and lower 
bounds for the mean squared error of the recovered matrix 
which demonstrates the optimality of the estimator. Recently, 
Q considers matrix completion over finite alphabet with a 
nuclear norm regularization, and considers a more general 
sampling model that only requires knowledge about an upper 
bound for the entries of the matrix; a theoretical upper bound 
is given therein, which has a faster convergence rate than that 
in ||TJ; there is no theoretical lower bound though. A more 
recent work Q provides a lower bound for the special case 
when K = 2. Other related work on matrix completion with 
quantized entries or Poisson observations include 

The rest of this paper is organized as follows. Section II 
sets up the formalism for categorical matrix completion and 
the nuclear norm regularized maximum likelihood estimator. 
Section III establishes the upper and lower bounds for the 
recovery error. Section IV presents an numerical example 
using the MovieLens dataset to demonstrate the performance 
of our method. All proofs are delegated to the appendi?(Q 
The notation in this paper is standard. In particular, [d] = 
{1,2,..., d}; I[£] is the indicator function for an event e; | A| 
denotes the number of elements in a set A. Let entries of a 
matrix M be denoted by Mij or ||M|| be the spectral 

norm which is the largest absolute singular value, ||M||i^’ = 
j Mfj be the Frobenius norm, ||M||* be the nuclear norm 
which is the sum of the singular values and finally ||M||oo 
= maxij \Mij\ be the infinity norm. Let rank(M) denote the 
rank of a matrix M. The inner product for two matrices Mi 
and M 2 is denoted by {Mi, M 2 ) = ti{Mi M 2 ). Given the 
number K of categories and any set (oi, 02 ,..., Qk}, we say 
that a random variable X satisfies the categorical distribution 
with the parameters {pi,p 2 ,... ,pk) if Y)k=iPk = 1 and 
P(X = Qk) = Pk for all k G \K]. Also define the Kullback- 
Leibler (KL) divergence between two categorical distributions 
with parameters {pi,... ,pk) and {qi,... ,qK) as 

K 

D {{pi,P2, ■ ■ ■ ,PK)\\{qi,q2, ■ ■ ■ .qx)) = V'pfclog—, 

ni. 


'Full version of the paper can be downloaded from 
www2.isye.gatech.edu/~yxie77/Categorical-MC-CAMSAP.pdf 



and define their Hellinger distance as 


K 

dn {{pi,P2, ■ ■ ■ ,PK),iQi>Q2, ■ ■ ■ ,qK)) = Vnf ■ 

fe=i 

II. Formulation 

Suppose we make noisy observations of a matrix M G 
M‘^iX ‘^2 qjj index set If C [di] x [^ 2 ]- The indices are ran¬ 
domly selected with E|il| = m, or, equivalently, the indicator 
functions are i.i.d. Bernoulli random variables with 

parameter m/{did 2 )- Assume that the observed entries take 
one of the K possible values: {oi, 02 ,..., ax}- Given a set 
of differentiable link functions fk,k = l,...,K that satisfy 
fk{x) = 1, we have that the noisy observations follow 
the categorical distribution: 

Yij = Ofc with probability for (z,j) G fl. (1) 

Our goal is to recover M from the categorical observations 
further filling the missing entries using the 
link functions and the entries of recovered matrix M. This 
is done by letting Yij = Ofc., for all (*,j) G ([di] x [d2])/0, 
where k* = a.rgmaxi<k<K 

The following are two simple illustrative examples for link 
functions. In a df-categorical recommender system, there are 
K possible ratings, and the matrix with entries G \K\ 
is the true rating matrix of di users for d 2 items. Suppose 
users can be in three possible moods: good, normal, and bad. 
The link function characterizes the bias of a user and we can 
observe a subset of biased ratings. Suppose a user tends to rate 
an item one category lower than the truth in a bad mood, and 
one category higher than the truth in a good mood, with the 
probabilities of being in bad, normal, and good mood being 
0.2, 0.6, and 0.2, then the link functions are given by 

{ /i(l) = 0.8;/i(2) = 0.2;/i(a;) = 0, otherwise. 
fk{k - 1) = 0.2; fk{k) = 0.6; fk{k + 1) = 0.2; 
fk{x) = 0, Otherwise, k = 2,..., K — 1; 
fxiK - 1) = 0.2; fxiK) = 0.8;fi{x) = 0, otherwise. 

The second example is the widely used proportional-odds 
cumulative logit model, or multinomial logistic model 
where 

/fc(a;) cx k = [K], (2) 

and fk{x) = 1. Here Uk and Pk are parameters of the 

model that are given (or obtained from a training stage). 

In addition, we make assumptions for the matrix M to be 
recovered. First, we assume an upper bound a for ||M||oo to 
entail the recovery problem is well posed. Second, similarly 
to the conventional matrix completion, we assume that the 
nuclear norm of the matrix is bounded ||M||* < a\/rdid 2 . 
This assumption can be viewed as a relaxation of ||M|joo < a 
and rank(M) < r ||lj, since ||M||» < •y/rank(M)||M||i? and 
\\M\\f < Vdid 2 \\^\oo lead to ||M||, < a^/rd^. 

To estimate M, we consider the following nuclear norm 
regularized maximum log-likelihood formulation. In our case. 


the log-likelihood function is given by 

K 

Fn.Y{X)^ ^ ^I[y^^=,,]log(/fc(A,,)). 

(z k—1 

Based on the assumptions above, we consider a set S of 
candidate estimators: 


5 ^ {a G : ||A|U < a^/rdid2, 


-a < Ay < a,y{i,j) G [di] x [^ 2 ]}, 

(3) 

and recover M by solving the following optimization problem 
M = argmaxFa_y(A). (4) 

X 

This problem is convex and it can be solved exactly by the 
interior-point method |12| or approximately by the efficient 
singular value thresholding method 113|. 


III. Performance bounds 


To establish our performance bounds, we make the following 
assumptions on the link functions fk. Define for any k G [K] 
and a region x G [—a, a] 



A l/K^)l 

= sup . ' , 

|a:;|<Q: Jk\^) 


fdaix) = 


max 

l<k<K 


fkix) 


Assume that (1) there exists a positive constant such that 


max < La. 

l<k<K 


(5) 


The interpretation of this assumption is that the function fk (x) 
does not change sharply when it is near the boundaries of the 
region; and (2) there exist two positive constants /3“ and /3+ 
such that 


inf I3a{x) > Pa and sup Pa{x) < /3+. (6) 

kl<« |a;|<a 

This lower bound for Pa{x) means that for every fixed x G 
[—a, a], there exists at least one k G [K] such that fk does not 
change too slowly. Another interpretation for the assumption 
on the lower bound is that fk’s overlap moderately so that we 
may determine the category uniquely for a given x G [—a, a]. 
The interpretation of upper bound for Pa{x) is similar to that 

(k) 

for the upper bound La . When K — 2, these assumptions 
coincide with those in |[T). 

Many link functions satisfy the previous two assumptions, 
including the widely used multinomial logistic model Q. Fig. 
[^illustrates one such example of link functions where a = 10 
and K = 5. Furthermore, define the average Hellinger distance 
and KL divergence for entries of two matrices P,Q G 
as: 

4 ifiP), fm E 4 (/(4). /Wu)), 

(7) 

D {f{P)\\f{Q)) E ^ (/(^u)ll/(Qu)) ■ 


2 












( 11 ) 



Fig. 1: fk{x), k € [K\ with q = 10 and K = 5. 


The following two lemmas are needed to prove the upper 
bound. To use the contraction principle in Lemma [T] we 
introduce a function 

Fn.yiX) = Fn^riX) - Fa, y{0). (8) 


If m > {di + d 2 ) \og{did 2 ) then [lO) simplifies to 
1 


did: 


-\\M-M\fp < 


2 ^ \/2C'aKLci Ir{di + ^ 2 ) 




Above, C, C are absolute constants. 


Remark 1. The ratio La/fif depends on the number of 
categories K implicitly though the maximization of the 
smoothness of the functions f^. 


Remark 2. For a fixed a, we can construct function fk’s such 
that the ratio KLa/fif is less than some absolute constant 
for any given K. In other words, we may be able to choose 
the link functions fk such that the upper bound is independent 
of the number of categories. Therefore, how to choose fk that 
satisfies the classification requirement as well as minimizing 
this ratio becomes important. Examining the first inequality 
in {14\, a good choice for fk should be that for any x,y G 
[—a, a], X f y, there exists at most one k € [K] such that 
fk{x) = fkiy)- Fortunately, such fk is not hard to construct 
and one such example is the multinomial logistic model, as 
demonstrated in Fig. 


Lemma 1. Let Fq viX) be the likelihood function defined in 
(0 and S be the set defined in Q, then 

pjsup |Fo,y(X)-EFn.y(X)| 

Ljcgs 

> C'KLaayf- ■ (^\/m{di + 0 ( 2 ) + did 2 log((ii(i 2 )) | 

C 

~ did2' 

(9) 

where C and C are absolute positive constants and the 
probability and the expectation are both over 17 and Y. 


Lemma 2. For Mij and Mij both in [—a, a], y{i,j) G [di] x 
[^ 2 ], we have 


4 (/(M),/(M))> 


/3- ||M-M|||. 
4 did2 


Our main results are the upper bound for the average mean 
square error per-entry in Theorem and an information 
theoretic lower bound in Theorem |2l 


Theorem 1 (Upper bound). Assume M G S, and 17 is 
chosen at random following the binomial sampling model 
with E[|17|] = TO. Supmse that Y is generated as in 0. Let 
La and (i~ be as in 0 and (|^. Let M be the solution to (|^. 
Then with a probability exceeding (1 — C/(diC? 2 )), we have 


did: 


1 II. 't~?ii 2 ColKL f, 


fie, 


r{di +d2) 


m 


{di +d 2 )log(did 2 ) 


m 


( 10 ) 


Remark 3. Given K,fk and a, the mean squared error per 
entry in tends to 0 with probability 1 as the dimensions of 
the matrix M goes to infinity and r = o(log{did 2 )). In other 
words, one can recover accurately with a sufficiently large 
number of observations. 

The following lemmas are used in proving the lower bound. 

Lemma 3 (Lemma A. 3 in 0)- Let S be as in 0 - Let 7 < 1 
be such that r/ 7 ^ is an integer. Suppose r/ 7 ^ < di, then we 
may construct a set x G S of size 

1*1 ^ (w) 

with the following properties: (1) for all X G x, each entry 
has \Xij\ = ay; and (2) for all € x, i ^ j, HXl*! — 

> a^y^did2/2. 

Lemma 4. Given K categories, the KL divergence for two cat¬ 
egorical probability distributions with parameter (pi,... ,Pk) 
and (qi,..., qx), is upper-bounded by 

D{{xi,.. .,XK)\\{yi, ■. ■,yK)) < 

K-1 

[{xk - Vkf' + (xkVk - xl){l - yx) 

k^l 

K-1 

+ (xkVk - yfc)(l - XK)]l[yk{i - ^ yi)]- 

i=l 

The following theorem shows the existence of a worst case 
scenario in which no algorithm can reconstmet the matrix with 
an arbitrarily small error; 

Theorem 2 (Lower bound). Fix a, r, di, and d 2 to be such 
that a,di,d 2 > 1, r > 4 and a^r max{di, 6 ( 2 } > Cq. Assume 


3 



















that each is decreasing, and its derivative is increasing 
in [a — 1/4, a] for all k & [K — 1], and fK{c( — 1/4) > 1/2. 
Let LI be any subset of [di] x [dg] with cardinality m. Let Y 
be as in ([^ and /3/r be as in Consider any algorithm 
which, for any M G S, returns an estimator M. Then there 
exists M € S such that with probability at least 3/4, 


-R{M,M) 


did2 

> min < Ci,C 2 


a 


rmax{di,d2} 


( 12 ) 


as long as the right-hand side of © exceeds 
ra^/min{di, d 2 }, where Co,Ci,C 2 are absolute constants. 


V. Discussions 

We have studied a nuclear norm regularized maximum 
likelihood estimator for categorical matrix completion, as well 
as presented an upper bound and an information theoretic lower 
bound for our proposed estimator. Our upper and lower bounds 
meet up to a constant factor where K is the fixed 

number of categories, and this factor can become 
in some special cases. Our current formulation assumes that 
the input variables form a low-rank matrix and each response 
is linked only to one corresponding input variable. Future 
extension may include a formulation that allows more general 
link functions with multiple input variables and exploits a 
low-rank tensor structure. 


Remark 4. The ratio between the upper bound and lower 
bound is proportion to K^/'^La sjPi/Pa- However, if we 
carefully construct fk as in Remark 2 so that KLa/{j3~) is 
less than some absolute constant, the gap between the upper 
and the lower bound can be reduced to a factor that is on the 
order of 0{\/K). 

IV. Numerical examples 

To test the performance of the regularized maximum 
estimator on real data, we consider the MovieLens dataset 
(can be downloaded at http;//www.grouplens.org). The dataset 
contains 10^ movie ratings from 942 users can be viewed as 
a 942-by-1683 matrix whose entries take value {1,2, 3,4,5}. 
We hrst randomly select 5000 ratings to fit a multi-nomial logit 
regression model, and then solve the optimization problem 0 
using another randomly selected 95,000 ratings as observed 
entries. Finally, we use the remaining 5000 ratings as test data 
and compute the average difference between the tme rating and 
the predicted ratings based on the recovered matrix, as shown 
in the first row of Table We compare the result with that 
obtained from the conventional matrix completion by rounding 
the recovered entries to {1, 2, 3,4, 5}. The results are shown 
in the second row of Table |I] 

The results show that our method performs better when 
the original rating is larger than 3, and has better overall 
performance. This can possibly be explained by the following 
reasoning. Our multinomial logistic model is fitted using the 
training data, and in the MovieLens dataset there are relatively 
few ratings with values 1, 2 and 3. Therefore, the fitted link 
functions for k = 1,2,3, have much less accuracy than 
k = A and 5. Hence, we can see that the categorical matrix 
completion performs better when the true rating is 4 and 5. 

TABLE I: Average differences between tme ratings Mij and 
recovered ratings Mij, when they are both in {1, 2, 3,4, 5}. 
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Original Rating 

1 

2 

3 

4 

5 

Overall 

Categorical 

1.537 

0.958 

0.461 

0.489 

0.986 

0.708 

Real-valued (conventional) 

0.039 

0.119 

0.428 

1.159 

1.244 

0.783 
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Proofs 


Proof of Lemma 

In order to prove the lemma, we let are 1.1.d. Rademacher 
random variables. In the following derivation, the first inequal¬ 
ity uses the Radamacher symmetrization argument (Lemma 6.3 


in 1141) and the second inequality is due to the power mean 
inequality; if Q > 0,V1 < 

i < K and h > 1. Then we have 


E 


sup \Fn^Y{X) — ELh,y(-’^)| 


xes 


where E denotes the matrix with entries given by e^, Aq 
denotes the indicator matrix for fl and o denotes the Hadamard 
product. 

To bound E [||i? o AqU^] , we can use the result from Q 
if we take h = log(did 2 ) > 1: 

E[|!L;oAaf] 

<c„ (2(1 + Ve)f f 


<2'‘E 


=2'‘E 


sup 

xgs 


K 


XI X log 


sup 

x^s 


K 


\k^l 


fkjXjj 

fk{0) 


k—1 i,j 


TT TT 1 fk{Xij) 

X X log 


for some constant Cq. 

Moreover when C' > 8 (1 + v/6) e, 

/„, \ log(dld2) 

8(1 +Ven 


Co 


< 


Co 


C I - did2 

Therefore we can use Markov inequality to see that 


fc=i 

K 


sup 

XgS 


X «^8jll[(8d)60]II[’n,=afc] log 


<2^K^-^Y^ E 


fc=i 


max 11 

i,3 


lYij=ak] I 


E 


sup 

xes 


X log 


K 




sup iFn^viX) — E+’a_y(X)| 

Ljcgs 

> C'K (a\/r) La' 

(^a/ m{di Fd2) + did2 log(did2)) | 

=p|sup \Fn,Y{X) -¥.Fn,Y{X)\^ 

Vx<^s 

> {C'K {a\/r) La- 
m{di + d2) + did2 log(did2) 




h~ 


_ _ 

sup 

x^s 


<E 

sup |Fn,y(A)-£;Fo,y(A)|'‘ 
xes 

L 



{(C"iT (^ay/r) La ■ 


fc=i 


where the expectation are over both fl and Y. 

In the following, we will use contraction principle to further 

I I 

bound the first term of (13i. By the definition of La , we 
know that 

1 1 fk(.x) 


-TTY log 


L^a'’ /fc(ll) 

are contractions that vanish at 0 for all fc = 1, 2,..., iT. By 


Theorem 4.12 in |14| and using the fact that 1(^4, i3)| < 
||24||||i?||*, we have 


E 


sup |Fa,y(A) — EFn^y(A)| 
.xes 


< 


(2Lwy 


E 




sup 

x^s 




(^-v/m(di +d2) + did2 log(did2)) ) } < 
where C" > 8(1 + VQ)e and C are absolute constants. 

■ 

Proof of Lemma ^ Assuming x is any entry in M and 
y is any entry in M, then —a < x,y < a and by the mean 
value theorem there exists € [x, y] for each k G [K] such 

h{x) - \/fk{y) = A x-y)- 

JkiW 

By the assumption of /, there exist at least one k G [K] such 
that fki^k) 0. Then 

d-H iifiix),..., fKix)) II ifi{y),...jK{y))) 

K ,2 „.\2 K (fUCY\\2 

= X {Vfkix) - \fJk{ 


< UKf { max Llf'l ) E 

\i<k<K ' 


sup IliJoAaf ||X|| 
X^S 


k^l 


2 _ (a:-yrj^ if' 


> 


{x - yf f 


- 1 max -;—— 

4 \l<k<K /fe(^fc) 


4 ^ /fe(Cfc) 

ifki^k)?' 


< 


(AK)^ {Lat (a^/^^y E[\\E o Ant] , 


( inf ( max 


|5|<aVlX<iY fkiO 


(14) 


5 


































































Then the lemma is proven by summing across all entries and 
dividing by did 2 . 


then 


Proof of Lemma^ Let Zk = Uk — Xk for each k £ [K], 

D {{xi,... ,XK)\\{yi, ■ ■ ■ ,yK)) 

=D ((xi,... ,a;*:)||(a;i + Zi,... ,Xk + zk)) 

K 

Xk 


= ^Xk log 


k=l 


Xk + Zk 


And then we have for each k G [K] 

dD {{xi,.. .,xk)\\{xi + zi,...,xk + Zk)) Xk 


dzk Xk Zk 

By mean value theorem, we have 

D {{Xi, . . .,XK)\\ixi + Zi,...,XK + Zk)) 


K 


= -E 




Xk^k 
Xk + CZk 


(15) 


for some c S [0,1]. Since for each k G [K] 
XkZk \' _ Xkzl 


Xk+CZkJ (xk+CZk)"^ 


> 0 , 


the right-hand side of (15 i is an increasing function in c and 
hence 


K 


D{{xi,...,XK)\\{yi,---,yK)) < V —{xk -yk)- 

Noting that xk = 1- YaJi Xi and = 1 - yi’ we 

have 

K 


K-1 


S V» 1 -Eti !/i 


K-1 


k^l 


K-1 


= ^ [{xk - ykf + Xkyki‘2 - xk - yK) 

- xUl - yK) - ylil - XK)]/[yk “ E 

K-1 

= E + (xkVk - xl){l - yK) 


k=l 


K-1 


(16) 


+ {xkyk - ?/fe)(l - XK)]/[yk{l - y*)]’ 


where the last inequality uses the fact that 0 < Xk,yk < 
1 Vfc G [K]. 


Proof of Theorem First, note that 


K 


FQ,y{X) - Fa,Y{M) = ^ ^I(y,j-=afc) log 

(i.j)Gf 2 fe=l 


fk[X,i) 

fkiM^jY 


Then for any X £ S, 


K 


E [far(x) - fav (M)] = ^ E E A(««) >»* fgjy 

‘^ij k -1 

= -mD{f{M)\\f{X)). 


(17) 


For M G 5, we know M G S and Fq^y{M) > Fa^Y{M)- 
Thus we write 

0 < Fq^y{M) — Fq^y{M) = Fq,y{M) — Fq,,y{M) 

= Fq_,y{FI) + EFh,y(lff) — EFn,y(M) 

+ 'EFq,y{M) — EFh,y (Iff) — Fq,y{M) 


< E 


Fn,Y{FI) — FQ,y{M) 


Fq^y{M) — EFn,y(M) + \Fq^y{M) — EFf2^y(M)| 

< -mDif{M)\\f{M)) + 2 sup |i^a,y(X) - EFn,y(X)| . 

xes 

Applying Lemma [T] we obtain that with probability at least 

(l-C/(did2)), 

0<-mD{fiM)\\f{M)) 

+ 2C'K (ay/r) La ■ (^\/m{di + ^ 2 ) -|- did 2 log(did 2 )) ■ 


After rearranging terms and applying the fact that y/didf, < 
di + d 2 , we obtain 

D{fiM)\\f{M)) < 




{di + d2)log{did2) 


m 


(18) 


Note that the KL divergence can be bounded below by the 
Hellinger distance (Chapter 3 in |15)): d^ix^y) < D{x\\y). 
Thus from ( [TSl l, we obtain 

4(/(M), f(M)) < 2C'K {a^/r) La- 

di + d2 L ^ (di-b d2) log(did^ \ (19) 

mV m j 

Finally, Theorem [T] is proved by applying Lemma ■ 

Proof of Theorem 

We will prove by contradiction. Lemma and Lemma are 
used in the proof Without loss of generality, assume ^2 > di- 
Choose e > 0 such that 


e = mm ■ 


1 a 

:X2- 


/ rd2 

V m 


where C 2 is an absolute constant that will be be specified later. 
First, choose 7 such that is an integer and 

4^26 8e 1 

—— < 7 < — < — 
a a 4a 
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We may make such a choice because 


1^ we have 


2 2 
a r r a r 
< ^ < 


K-l 


64g2 - ^2 - 32e2 


and 


a r a r a r 


> 4a‘‘r > 1 . 


1 


\\X-X\\%<e^ 


di dj2 


X* = arg min - X\ 

x(‘)ex 


2 

F’ 


F{X* ^X)< 


D <m 




fc=i 

if-i 


min {/fe (a)/if (a), /fc (a ')/k (a')} 


32e2 64e2 64e2 

Furthermore, since we have assumed that is larger than 
Cro? jdx, rj^’^ < di for an appropriate choice of C. Let 
^a /2 7 defined in Lemma by replacing a with 

q;/2 and with this choice of 7 . Then we can construct a packing 
set X of the same size as Xa /2 7 defining 

X^{x' + a{l-^) e x:./ 2 . 7 } • 

The distance between pairs of elements in x is bounded since 

||XW - > Mid2e^. ( 20 ) 

Define a' = (1 — 7 ) 0 , then every entry of X S x has Xij € 
{a, a'}. Since we have assumed r > 4, for every X £ x^ 
have 

||X|U = ||X' + a(l-|) UxdJI* 

< ||X'|U+a(l-^)v^ 

< |\/rdid2 + a\/ did2 < a\/rdi^, 

for some X' £ Xa /2 7 - ^ittce the 7 we choose is less than 
1/2, a' is greater than a/2. Therefore, from the assumption 
that 13 < a/2, we conclude that x C 5. 

Now consider an algorithm that for any X G S returns X 
such that ^ 

( 21 ) 


(/fc(«)/fc(a') - /fc(Q))(l - /if(a')) 

min{fk{a)fKia),fk{a')fK{a')} 

iMa)Ma')-fl{a')){l-fK{a)) 


(24) 


K-l 


+mY, 




min {/fe (a)/*: (a), /fc (a')/if (a')} 


Considering the assumptions of fk in the theorem for all 
k G [K], we can further bound D as follows: 


K-l 


D <m 


K-l 


+mJ2 

k^l 

<m{K — l)/3, 


(f/(a))^(a-a')^ 

fk{a)fK{a') 

(/fc(a') - /fc(a))(l - /if(a')) 


Ik (a') 

<2mKI3+{xaf + 


/ir(aO 

2 

+ m- 


(25) 


+ (7a)^ _(l_/^(a '))2 


Ik (a') 


where we use the mean value theorem and facts that //(a) > 
fkia'), fkia) < /fc(a') and /if(a) > /if(a') in the first 
inequality and fact that /if (a') > 1/2 in the third inequality. 
Combining ( |22| ) and (j2^, we have that 

^ < 1 -P(X^X*) < 

4 ^ ^ ^ log 1^1 


< 167 ^ 


0 + 1 
rd2 


< 1024e' 


/ 128mX/3+e^ + f+ 1 
V 


(26) 


with probability at least 1 /4. Next, we will show this leas to 
an contradiction. Let 


Suppose 128mX/3+e^ 
1 


< 1 , then with (26i, we have 


< 1024e" 




by the same argument as that in ij^, we have X* = X as 
long as holds. Using the assumption that © holds with 
probability at least 1/4, we have 


which implies that a^rd 2 < 32. Then if we set Co > 32, this 
leads to a contradiction. Next, suppose 128mKf3^e , ^ 
then with (|26]l, we have 


+- 2 ^^ > 1 , 


T < 1024e^ 


/ 256mK /3+ £2 + m 




a^rdo 


( 22 ) 


thus. 


Using a generalized Fano’s inequality for the KL divergence 
in (H, we have 

^ 0(ya|2fW||yn|xW) + l 

r^A T^Aj^i -^—j—j-. 

loglxl 

(23) 

Define 

0 4D(rn|2fW||ra|2f«)= J] D{Y,,\x[f\\Y,,\xf^). 

Because the entries of X^^^ and X*^^^ are a or a', from Lemma 


£^> 


-l + f/l + 


\A 


a^rd2Kpi 

4m 


512X/3+ 

By using the fact that \/a^ + > (a+b)/\f2 for any a,b > 0, 

we have 

9 a I rd 2 

£2 > 


1024v^Vx^V m 

Setting C 2 < l/1024v^, this leads to a contradiction. 
Therefore, ( |21| l must be incorrect with probability at least 
3/4. This concludes our proof. 
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